message = F and warning = F are useful for
your libraries chunk.install.packages() code in your
submission. Package installs can be done in the console and should not
be run every time you knit.From the Course Outline the pre-requisites for this class are: STATS 220 and STATS 210 or 225 and 15 points from ECON 221, STATS 201, 208, or ENGSCI 314.
Choose one of these courses (presumably one you’ve taken) and write one multichoice question on a topic from that course. The audience should be your peers in this course who are aiming to revise topics from the prerequisites. If I have time, I will make some of these into a practice quiz on Canvas. You can just write a note if you don’t want your question considered for inclusion, it won’t affect your mark.
You will be marked on the correctness and quality of your question and explanation. The question does not have to be hard, per se, but should be USEFUL to you and your fellow students.
A company called Black Saber1 has been trialling a new AI recruitment
pipeline manager for their Data and Software teams. There are three
phases, outlined below, each narrowing down the field of applicants.
Based on advice from their legal team, they are not able to provide you
with the original application data, but they can provide these
anonymised indicators/ratings from each phase. applicant_id
is consistent across phases.
phase1-new-grad-applicants-2022.csv
In the first phase of the hiring pipeline applicants complete a form and are asked to submit a CV and cover letter. Extracurriculars and internship experience are auto-rated based on the descriptions applicants provide in the application form.
| Variable | Description |
|---|---|
applicant_id |
A unique ID assigned to applicants in Phase 1 |
team_applied_for |
Software or Data |
cover_letter |
0 if absent, 1 if present |
cv |
0 if absent, 1 if present |
gpa |
0.0 to 4.0 (American style) |
gender |
Gender of employee: ‘Man’, ‘Woman’, ‘Prefer not to say’ only options provided |
extracurriculars |
The description of extracurricular involvement is assessed against a proprietary key term and phrase bank and given a 0, 1 or 2 for where 2 indicates several high relevance and/or skills building extracurriculars, 1 indicates some relevant and/or skills building extracurriculars and 0 indicates no extracurriculars describes or that those describe were not rated as high relevance or high skills building |
work_experience |
Similar to extracurriculars, the description applicants
provided is assessed against a proprietary key term and phrase bank,
that also considers company names and reputations, to give a 0, 1 or 2
score, with 2 being the best, 0 the worst |
phase2-new-grad-applicants-2022.csv
We don’t know exactly how these are being assessed by the AI, the algorithm is commercially sensitive but their demonstrations of the system were impressive.
| Variable | Description |
|---|---|
applicant_id |
A unique ID assigned to applicants in Phase 1 |
technical_skills |
Score from 0 to 100 on a timed technical task, AI autograded |
writing_skills |
Score from 0 to 100 on a timed writing task, AI autograded |
speaking_skills |
A rating of speaking ability based on pre-recorded video, AI autograded |
leadership_presence |
A rating of ‘leadership presence’ based on pre-recorded video, AI autograded |
phase3-new-grad-applicants-2022.csv
This is the information from interview phase. Being listed as ‘first’ or ‘second’ interviewer is arbitrary and who the interviewers were is not available from our tracking system. Applicant IDs are listed across the top and then the two scores for the applicant are listed below their ID.
The average score of the two interviewers was used to determine final hires.
final-hires-newgrad_2022.csv
This data set contains the applicant IDs of everyone who was sent an offer letter. In this cohort, everyone accepted.
| Variable | Description |
|---|---|
applicant_id |
A unique ID assigned to applicants in Phase 1 |
head()) of your finished dataset’s first 10
rows (do not print the whole thing!) [4 marks]# Read in each round and final hires
first_round <- read_csv("data/2022-phase1-new-grad-applicants.csv")
second_round <- read_csv("data/2022-phase2-new-grad-applicants.csv") |>
mutate(passed1 = 1)
third_round <- read_csv("data/2022-phase3-new-grad-applicants.csv")
final_hires <- read_csv("data/2022-final-hires-newgrad.csv")
# Third round needs to be made tidy and the average scores calculates
third_tidy <- third_round |>
pivot_longer(-applicant_id) |>
rename(interview = "applicant_id", applicant_id = "name", score = "value") |>
group_by(applicant_id) |>
summarise(interview_score = mean(score)) |>
# Make ID numeric to join with the others
mutate(applicant_id = as.numeric(applicant_id)) |>
mutate(passed2 = 1)
# Must add another variable to the final_hires dataset to indicate being a final hire
final_hires <- final_hires |>
mutate(hired = "hired") |>
mutate(passed3 = 1)
# Join all cleaned datasets
all_data <- first_round |>
left_join(second_round) |>
left_join(third_tidy) |>
left_join(final_hires) |>
# Optional organising
select(applicant_id, starts_with("pass"), everything())
head(all_data)#> # A tibble: 6 × 17
#> applicant_id passed1 passed2 passed3 team_applied_for cover_letter cv gpa
#> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1010 NA NA NA Software 0 1 1.3
#> 2 1020 NA NA NA Software 0 1 3.4
#> 3 1030 1 NA NA Data 1 1 2.4
#> 4 1040 NA NA NA Software 0 1 2.7
#> 5 1050 NA NA NA Data 1 0 2.1
#> 6 1060 NA NA NA Software 0 1 2.6
#> # ℹ 9 more variables: gender <chr>, extracurriculars <dbl>,
#> # work_experience <dbl>, technical_skills <dbl>, writing_skills <dbl>,
#> # leadership_presence <dbl>, speaking_skills <dbl>, interview_score <dbl>,
#> # hired <chr>
# A basic anova suggests that mean speaking skills ratings are not the same for all gender groups
anova(lm(all_data$speaking_skills ~ all_data$gender))#> Analysis of Variance Table
#>
#> Response: all_data$speaking_skills
#> Df Sum Sq Mean Sq F value Pr(>F)
#> all_data$gender 2 250.1 125.049 25.346 6.869e-11 ***
#> Residuals 297 1465.3 4.934
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# A basic anova suggests that mean leadership presence ratings are not the same for all gender groups
anova(lm(all_data$leadership_presence ~ all_data$gender))#> Analysis of Variance Table
#>
#> Response: all_data$leadership_presence
#> Df Sum Sq Mean Sq F value Pr(>F)
#> all_data$gender 2 54.74 27.3712 5.1122 0.006564 **
#> Residuals 297 1590.17 5.3541
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Create the UGLIEST version of the graph you created about that you can. Dancing T-rex as the background or colours that make your eyes bleed? Go for it! Note: You are free to use any functions you can find, you don’t have to stick as closely to course content.
Consider the following code. It scrapes the text of a poem about statistics from a webpage.4
#> [1] "\n In this vast age of data's endless stream, A science blooms with wonders to behold, Where bits and bytes converge in seamless theme, Unveiling truths that were once left untold.\n\nWith algorithms and models as our guide,\nWe journey through the realms of structured lore,\nEach data point a star to be untied,\nTo find the patterns hidden deep in core.\nThrough clustering, we sort and classify,\nRegression leads us to predictive might,\nIn neural networks, connections amplify,\nEmerging knowledge, dazzling and bright.\nOh, data science, thou art a beacon rare,\nIlluminating paths to futures fair.\n\n Written by ChatGPT - The AI Poet | 2023\nI would like to make it clear that I take no responsibility for any crimes against poetry committed here. - Liza\n\n"
#> [1] "Inthisvastageofdata'sendlessstream,\nAsciencebloomswithwonderstobehold,\nWherebitsandbytesconvergeinseamlesstheme,\nUnveilingtruthsthatwereonceleftuntold.\n\nWith algorithms and models as our guide,\n\nWe journey through the realms of structured lore,\n\nEach data point a star to be untied,\n\nTo find the patterns hidden deep in core.\n\n\n\n\nThrough clustering, we sort and classify,\n\nRegression leads us to predictive might,\n\nIn neural networks, connections amplify,\n\nEmerging knowledge, dazzling and bright.\n\n\n\n\nOh, data science, thou art a beacon rare,\n\nIlluminating paths to futures fair.\n\n\n\n\nWritten by ChatGPT - The AI Poet | 2023\n\n\n\n\nI would like to make it clear that I take no responsibility for any crimes against poetry committed here. - Liza"
This works as expected when scraping using html_text,
but has a problem when using html_text2. These are both
functions from the rvest package (Wickham 2022) and html_text2
provides more nicely formatted outputs, i.e., according to the help
text: “html_text2() simulates how text looks in a browser, using an
approach inspired by JavaScript’s innerText(). Roughly speaking, it
converts <br /> to ``\n’’, adds blank lines around <p> tags,
and lightly formats tabular data.”
Our problem is that when using html_text2, some of the
spaces are dropped and the words are all smushed together as part of
this reformatting.
Suppose three students have each created an example to report this
potential bug to the rvest development team. Using the
article on reprex dos and don’ts (Bryan et al.
2022) and broader information about the reprex philosophy, choose
THREE things to compare and contrast these three samples on.
Note: You do NOT need to be able to read the HTML to answer this question.
library(rvest)
html_text(read_html("https://link.lizabolton.com/a_scrapable_poem.html"))
#> [1] "\n In this vast age of data's endless stream, A science blooms with wonders to behold, Where bits and bytes converge in seamless theme, Unveiling truths that were once left untold.\n\nWith algorithms and models as our guide,\nWe journey through the realms of structured lore,\nEach data point a star to be untied,\nTo find the patterns hidden deep in core.\nThrough clustering, we sort and classify,\nRegression leads us to predictive might,\nIn neural networks, connections amplify,\nEmerging knowledge, dazzling and bright.\nOh, data science, thou art a beacon rare,\nIlluminating paths to futures fair.\n\n Written by ChatGPT - The AI Poet | 2023\nI would like to make it clear that I take no responsibility for any crimes against poetry committed here. - Liza\n\n"
html_text2(read_html("https://link.lizabolton.com/a_scrapable_poem.html"))
#> [1] "Inthisvastageofdata'sendlessstream,\nAsciencebloomswithwonderstobehold,\nWherebitsandbytesconvergeinseamlesstheme,\nUnveilingtruthsthatwereonceleftuntold.\n\nWith algorithms and models as our guide,\n\nWe journey through the realms of structured lore,\n\nEach data point a star to be untied,\n\nTo find the patterns hidden deep in core.\n\n\n\n\nThrough clustering, we sort and classify,\n\nRegression leads us to predictive might,\n\nIn neural networks, connections amplify,\n\nEmerging knowledge, dazzling and bright.\n\n\n\n\nOh, data science, thou art a beacon rare,\n\nIlluminating paths to futures fair.\n\n\n\n\nWritten by ChatGPT - The AI Poet | 2023\n\n\n\n\nI would like to make it clear that I take no responsibility for any crimes against poetry committed here. - Liza"Created with reprex v2.0.2
library(rvest)
some_html <- '<p dir="ltr" style="text-align:left;"></p><span style="font-size:0.9375rem;">The sentence starts this way,</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">then</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">spaces</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">disappear</span>'
html_text(read_html(some_html)) # is correct
#> [1] "The sentence starts this way, then spaces disappear"
html_text2(read_html(some_html)) # not correct
#> [1] "The sentence starts this way,thenspacesdisappear"Created with reprex v2.0.2
There are several points that could matter here. Sensible comments include:
library(rvest). B & C do.Marks for reasonable effort. It should be clear which task and which capability they are connecting.
Examples could be their summary for Black Saber in this assessment connecting to ‘4. Communication and Engagement’, as they are writing a summary for a non-stats audience, or it could also connect to ‘3. Solution Seeking’ — addressing issues of gender bias are part of “developing ‘solutions that shape and advance our futures’.”echo = F!) a setup chunk at/near the beginning of your
submission. It should include all the required libraries and suppressing
package loading messages and not have any
install.packages() commands.Remember to include references if you use AI, and you should reference the Graduate Capabilities document use use in the reflection and the ‘dos and don’ts’ article.
This isn’t a real company↩︎
Hint: The interview scores aren’t based on AI. Of the previous phases, consider which of these parts of the pipeline might be most impacted by potential bias in the training data. E.g., GPA is just being read from the form so probably doesn’t have bias issues. You might be interested to know that some studies suggest people (specifically American voters, but may be more generalisable) prefer leaders with lower-pitched voices (https://doi.org/10.1371/journal.pone.0133779) and that Amazon had to scrap it’s AI recruitment system due to bias (https://www.businessinsider.com/amazon-ai-biased-against-women-no-surprise-sandra-wachter-2018-10)↩︎
Approximately 100 to 300 words↩︎
We should always consider the ethics of web scraping. In this case, our target is my site, and I’ve set it up for you to scrape so we don’t have to do any other work.↩︎